- Session 1: How to think about data, and getting started plotting with the grammar of graphics
- Session 2: Wrangling data into shape, useful practices, cognitive perception
- Session 3: Advanced graphics, layering, using maps
NZSA/ORSNZ 2016
tidyr, dplyr%>%Pipes allow the code to be read like a sequence of operations
student <- readRDS("../data/student_sub.rds")
student %>% group_by(ST04Q01) %>%
summarise(m=mean(PV1MATH), s=sd(PV1MATH))
#> # A tibble: 2 × 3
#> ST04Q01 m s
#> <dbl> <dbl> <dbl>
#> 1 1 484 100
#> 2 2 495 105
What are the variables?
| Inst | AvNumPubs | AvNumCits | PctCompletion |
|---|---|---|---|
| ARIZONA STATE UNIVERSITY | 0.90 | 1.57 | 32 |
| AUBURN UNIVERSITY | 0.79 | 0.64 | 44 |
| BOSTON COLLEGE | 0.51 | 1.03 | 47 |
| BOSTON UNIVERSITY | 0.49 | 2.66 | 34 |
| BRANDEIS UNIVERSITY | 0.30 | 3.03 | 49 |
| BROWN UNIVERSITY | 0.84 | 2.31 | 55 |
What's in the column names of this data? What are the experimental units? What are the measured variables?
| id | WI-6.R1 | WI-6.R2 | WI-6.R4 | WM-6.R1 | WM-6.R2 | WI-12.R1 | WI-12.R2 | WI-12.R4 | WM-12.R1 | WM-12.R2 | WM-12.R4 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene 1 | 2.2 | 2.20 | 4.2 | 2.63 | 5.1 | 4.5 | 5.5 | 4.4 | 3.9 | 4.2 | 3.73 |
| Gene 2 | 1.5 | 0.59 | 1.9 | 0.52 | 2.9 | 1.4 | 3.0 | 1.3 | 1.2 | 1.2 | 0.89 |
| Gene 3 | 2.0 | 0.87 | 3.3 | 0.53 | 4.6 | 2.2 | 5.6 | 2.5 | 2.5 | 3.0 | 1.35 |
What are the variables? What are the records?
melbtemp <- read.fwf("../data/ASN00086282.dly",
c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), fill=T)
kable(head(melbtemp[,c(1,2,3,4,seq(5,128,4))]))
| V1 | V2 | V3 | V4 | V5 | V9 | V13 | V17 | V21 | V25 | V29 | V33 | V37 | V41 | V45 | V49 | V53 | V57 | V61 | V65 | V69 | V73 | V77 | V81 | V85 | V89 | V93 | V97 | V101 | V105 | V109 | V113 | V117 | V121 | V125 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASN00086282 | 1970 | 7 | TMAX | 141 | 124 | 113 | 123 | 148 | 149 | 139 | 153 | 123 | 108 | 119 | 112 | 126 | 112 | 115 | 133 | 134 | 126 | 104 | 143 | 141 | 134 | 117 | 142 | 158 | 149 | 133 | 143 | 150 | 145 | 115 |
| ASN00086282 | 1970 | 7 | TMIN | 80 | 63 | 36 | 57 | 69 | 47 | 84 | 78 | 49 | 42 | 48 | 56 | 51 | 36 | 44 | 39 | 40 | 58 | 15 | 33 | 51 | 74 | 39 | 66 | 78 | 36 | 61 | 46 | 42 | 63 | 39 |
| ASN00086282 | 1970 | 7 | PRCP | 3 | 30 | 0 | 0 | 36 | 3 | 0 | 0 | 10 | 23 | 3 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 18 | 0 | 0 | 0 | 0 | 13 | 3 | 0 | 25 | 0 | 3 |
| ASN00086282 | 1970 | 8 | TMAX | 145 | 128 | 150 | 122 | 109 | 112 | 116 | 142 | 166 | 127 | 117 | 127 | 159 | 143 | 114 | 65 | 113 | 125 | 129 | 147 | 161 | 168 | 178 | 161 | 145 | 142 | 137 | 150 | 120 | 114 | 129 |
| ASN00086282 | 1970 | 8 | TMIN | 50 | 61 | 75 | 67 | 41 | 51 | 48 | -7 | 56 | 62 | 47 | 33 | 67 | 84 | 11 | 41 | 18 | 50 | 22 | 28 | 74 | 94 | 73 | 88 | 50 | 48 | 54 | 78 | 47 | 18 | 39 |
| ASN00086282 | 1970 | 8 | PRCP | 0 | 66 | 0 | 53 | 13 | 3 | 8 | 0 | 0 | 0 | 3 | 5 | 0 | 0 | 64 | 3 | 99 | 36 | 8 | 0 | 0 | 0 | 8 | 36 | 25 | 30 | 56 | 5 | 69 | 3 | 20 |
What are the variables? What are the experimental units?
tb <- read_csv("../data/tb.csv")
#tail(tb)
colnames(tb)
#> [1] "iso2" "year" "m_04" "m_514" "m_014" "m_1524" "m_2534"
#> [8] "m_3544" "m_4554" "m_5564" "m_65" "m_u" "f_04" "f_514"
#> [15] "f_014" "f_1524" "f_2534" "f_3544" "f_4554" "f_5564" "f_65"
#> [22] "f_u"
What are the variables? What are the experimental units?
pew <- read.delim( file = "http://stat405.had.co.nz/data/pew.txt", header = TRUE, stringsAsFactors = FALSE, check.names = F ) kable(pew[1:5, 1:5])
| religion | <$10k | $10-20k | $20-30k | $30-40k |
|---|---|---|---|---|
| Agnostic | 27 | 34 | 60 | 81 |
| Atheist | 12 | 27 | 37 | 52 |
| Buddhist | 27 | 21 | 30 | 34 |
| Catholic | 418 | 617 | 732 | 670 |
| Don’t know/refused | 15 | 14 | 15 | 11 |
10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:
| time | treatment | subject | rep | potato | buttery | grassy | rancid | painty |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3 | 1 | 2.9 | 0.0 | 0.0 | 0.0 | 5.5 |
| 1 | 1 | 3 | 2 | 14.0 | 0.0 | 0.0 | 1.1 | 0.0 |
| 1 | 1 | 10 | 1 | 11.0 | 6.4 | 0.0 | 0.0 | 0.0 |
| 1 | 1 | 10 | 2 | 9.9 | 5.9 | 2.9 | 2.2 | 0.0 |
What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?
There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.
10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:
| time | treatment | subject | rep | potato | buttery | grassy | rancid | painty |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3 | 1 | 2.9 | 0.0 | 0.0 | 0.0 | 5.5 |
| 1 | 1 | 3 | 2 | 14.0 | 0.0 | 0.0 | 1.1 | 0.0 |
| 1 | 1 | 10 | 1 | 11.0 | 6.4 | 0.0 | 0.0 | 0.0 |
| 1 | 1 | 10 | 2 | 9.9 | 5.9 | 2.9 | 2.2 | 0.0 |
| 1 | 1 | 15 | 1 | 1.2 | 0.1 | 0.0 | 1.1 | 5.1 |
| 1 | 1 | 15 | 2 | 8.8 | 3.0 | 3.6 | 1.5 | 2.3 |
Each of these questions involves different summaries of the data.
Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)
Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities
ff_long <- gather(french_fries, key = variable, value =
rating, potato:painty)
head(ff_long)
#> time treatment subject rep variable rating
#> 1 1 1 3 1 potato 2.9
#> 2 1 1 3 2 potato 14.0
#> 3 1 1 10 1 potato 11.0
#> 4 1 1 10 2 potato 9.9
#> 5 1 1 15 1 potato 1.2
#> 6 1 1 15 2 potato 8.8
In certain applications, we may wish to take a long dataset and convert it to a wide dataset (perhaps displaying in a table).
This is called "spreading" the data.
We use the spread function from tidyr to do this:
french_fries_wide <- spread(ff_long, key = variable,
value = rating)
head(french_fries_wide)
#> time treatment subject rep buttery grassy painty potato rancid
#> 1 1 1 3 1 0.0 0.0 5.5 2.9 0.0
#> 2 1 1 3 2 0.0 0.0 0.0 14.0 1.1
#> 3 1 1 10 1 6.4 0.0 0.0 11.0 0.0
#> 4 1 1 10 2 5.9 2.9 0.0 9.9 2.2
#> 5 1 1 15 1 0.1 0.0 5.1 1.2 1.1
#> 6 1 1 15 2 3.0 3.6 2.3 8.8 1.5
Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.
We need to gather the data into long form, and make plots facetted by the scale.
ff.m <- french_fries %>% gather(type, rating, -subject, -time, -treatment, -rep) head(ff.m) #> time treatment subject rep type rating #> 1 1 1 3 1 potato 2.9 #> 2 1 1 3 2 potato 14.0 #> 3 1 1 10 1 potato 11.0 #> 4 1 1 10 2 potato 9.9 #> 5 1 1 15 1 potato 1.2 #> 6 1 1 15 2 potato 8.8
ggplot(data=ff.m, aes(x=rating)) + geom_histogram(binwidth=2) + facet_wrap(~type, ncol=5)
ggplot(data=ff.m, aes(x=type, y=rating, fill=type)) + geom_boxplot()
We will start to tackle this by plotting the replicates against each other using a scatterplot.
We need to gather the data into long form, and then get the replicates spread into separate columns.
head(ff.m) #> time treatment subject rep type rating #> 1 1 1 3 1 potato 2.9 #> 2 1 1 3 2 potato 14.0 #> 3 1 1 10 1 potato 11.0 #> 4 1 1 10 2 potato 9.9 #> 5 1 1 15 1 potato 1.2 #> 6 1 1 15 2 potato 8.8 ff.s <- ff.m %>% spread(rep, rating) head(ff.s) #> time treatment subject type 1 2 #> 1 1 1 3 buttery 0.0 0.0 #> 2 1 1 3 grassy 0.0 0.0 #> 3 1 1 3 painty 5.5 0.0 #> 4 1 1 3 potato 2.9 14.0 #> 5 1 1 3 rancid 0.0 1.1 #> 6 1 1 10 buttery 6.4 5.9
ggplot(data=ff.s, aes(x=`1`, y=`2`)) + geom_point() +
theme(aspect.ratio=1) + xlab("Rep 1") + ylab("Rep 2")
ggplot(data=ff.s, aes(x=`1`, y=`2`)) + geom_point() +
theme(aspect.ratio=1) + xlab("Rep 1") + ylab("Rep 2") +
scale_x_sqrt() + scale_y_sqrt()
Make the scatterplots of reps against each other separately for scales, and treatment.
Read in the billboard top 100 music data, which contains N'Sync and Backstreet Boys songs that entered the billboard charts in the year 2000
billboard <- read.csv("../data/billboard.csv")
What's in this data? What's X1-X76?
tidyr to convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)ggplot2 to create this time series plot:dplyr VerbsThe package dplyr helps to make various summaries of the data. There are five primary dplyr verbs, representing distinct data analysis tasks:
(Diagram originally from Hadley Wickham)
dplyrfrench_fries_split <- group_by(ff_long, variable) # SPLIT
french_fries_apply <- summarise(french_fries_split,
rating = mean(rating, na.rm = TRUE)) # APPLY + COMBINE
french_fries_apply
#> # A tibble: 5 × 2
#> variable rating
#> <chr> <dbl>
#> 1 buttery 1.82
#> 2 grassy 0.66
#> 3 painty 2.52
#> 4 potato 6.95
#> 5 rancid 3.85
french_fries %>%
filter(subject == 3, time == 1)
#> time treatment subject rep potato buttery grassy rancid painty
#> 1 1 1 3 1 2.9 0.0 0.0 0.0 5.5
#> 2 1 1 3 2 14.0 0.0 0.0 1.1 0.0
#> 3 1 2 3 1 13.9 0.0 0.0 3.9 0.0
#> 4 1 2 3 2 13.4 0.1 0.0 1.5 0.0
#> 5 1 3 3 1 14.1 0.0 0.0 1.1 0.0
#> 6 1 3 3 2 9.5 0.0 0.6 2.8 0.0
french_fries %>%
arrange(desc(rancid)) %>%
head
#> time treatment subject rep potato buttery grassy rancid painty
#> 1 9 2 51 1 7.3 2.3 0 15 0.1
#> 2 10 1 86 2 0.7 0.0 0 14 13.1
#> 3 5 2 63 1 4.4 0.0 0 14 0.6
#> 4 9 2 63 1 1.8 0.0 0 14 12.3
#> 5 5 2 19 2 5.5 4.7 0 13 4.6
#> 6 4 3 63 1 5.6 0.0 0 13 4.4
french_fries %>%
select(time, treatment, subject, rep, potato) %>%
head
#> time treatment subject rep potato
#> 61 1 1 3 1 2.9
#> 25 1 1 3 2 14.0
#> 62 1 1 10 1 11.0
#> 26 1 1 10 2 9.9
#> 63 1 1 15 1 1.2
#> 27 1 1 15 2 8.8
french_fries %>%
group_by(time, treatment) %>%
summarise(mean_rancid = mean(rancid),
sd_rancid = sd(rancid))
#> Source: local data frame [30 x 4]
#> Groups: time [?]
#>
#> time treatment mean_rancid sd_rancid
#> <fctr> <fctr> <dbl> <dbl>
#> 1 1 1 2.8 3.2
#> 2 1 2 1.7 2.7
#> 3 1 3 2.6 3.2
#> 4 2 1 3.9 4.4
#> 5 2 2 2.1 3.1
#> 6 2 3 2.5 3.4
#> 7 3 1 4.7 3.9
#> 8 3 2 2.9 3.8
#> 9 3 3 3.6 3.6
#> 10 4 1 2.1 2.4
#> # ... with 20 more rows
When the experimental design is packed into column names, we need to extract it, and tidy it up.
genes <- read_csv("../data/genes.csv")
kable(head(genes))
| id | WI-6.R1 | WI-6.R2 | WI-6.R4 | WM-6.R1 | WM-6.R2 | WI-12.R1 | WI-12.R2 | WI-12.R4 | WM-12.R1 | WM-12.R2 | WM-12.R4 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene 1 | 2.2 | 2.20 | 4.2 | 2.63 | 5.1 | 4.5 | 5.5 | 4.4 | 3.9 | 4.2 | 3.73 |
| Gene 2 | 1.5 | 0.59 | 1.9 | 0.52 | 2.9 | 1.4 | 3.0 | 1.3 | 1.2 | 1.2 | 0.89 |
| Gene 3 | 2.0 | 0.87 | 3.3 | 0.53 | 4.6 | 2.2 | 5.6 | 2.5 | 2.5 | 3.0 | 1.35 |
gather(genes, variable, expr, -id) %>% kable
| id | variable | expr |
|---|---|---|
| Gene 1 | WI-6.R1 | 2.18 |
| Gene 2 | WI-6.R1 | 1.46 |
| Gene 3 | WI-6.R1 | 2.03 |
| Gene 1 | WI-6.R2 | 2.20 |
| Gene 2 | WI-6.R2 | 0.59 |
| Gene 3 | WI-6.R2 | 0.87 |
| Gene 1 | WI-6.R4 | 4.20 |
| Gene 2 | WI-6.R4 | 1.86 |
| Gene 3 | WI-6.R4 | 3.28 |
| Gene 1 | WM-6.R1 | 2.63 |
| Gene 2 | WM-6.R1 | 0.52 |
| Gene 3 | WM-6.R1 | 0.53 |
| Gene 1 | WM-6.R2 | 5.06 |
| Gene 2 | WM-6.R2 | 2.88 |
| Gene 3 | WM-6.R2 | 4.63 |
| Gene 1 | WI-12.R1 | 4.54 |
| Gene 2 | WI-12.R1 | 1.36 |
| Gene 3 | WI-12.R1 | 2.18 |
| Gene 1 | WI-12.R2 | 5.53 |
| Gene 2 | WI-12.R2 | 2.96 |
| Gene 3 | WI-12.R2 | 5.56 |
| Gene 1 | WI-12.R4 | 4.41 |
| Gene 2 | WI-12.R4 | 1.31 |
| Gene 3 | WI-12.R4 | 2.54 |
| Gene 1 | WM-12.R1 | 3.85 |
| Gene 2 | WM-12.R1 | 1.18 |
| Gene 3 | WM-12.R1 | 2.54 |
| Gene 1 | WM-12.R2 | 4.18 |
| Gene 2 | WM-12.R2 | 1.19 |
| Gene 3 | WM-12.R2 | 3.04 |
| Gene 1 | WM-12.R4 | 3.73 |
| Gene 2 | WM-12.R4 | 0.89 |
| Gene 3 | WM-12.R4 | 1.35 |
genes %>%
gather(variable, expr, -id) %>%
separate(variable, c("trt", "leftover"), "-") %>%
kable
| id | trt | leftover | expr |
|---|---|---|---|
| Gene 1 | WI | 6.R1 | 2.18 |
| Gene 2 | WI | 6.R1 | 1.46 |
| Gene 3 | WI | 6.R1 | 2.03 |
| Gene 1 | WI | 6.R2 | 2.20 |
| Gene 2 | WI | 6.R2 | 0.59 |
| Gene 3 | WI | 6.R2 | 0.87 |
| Gene 1 | WI | 6.R4 | 4.20 |
| Gene 2 | WI | 6.R4 | 1.86 |
| Gene 3 | WI | 6.R4 | 3.28 |
| Gene 1 | WM | 6.R1 | 2.63 |
| Gene 2 | WM | 6.R1 | 0.52 |
| Gene 3 | WM | 6.R1 | 0.53 |
| Gene 1 | WM | 6.R2 | 5.06 |
| Gene 2 | WM | 6.R2 | 2.88 |
| Gene 3 | WM | 6.R2 | 4.63 |
| Gene 1 | WI | 12.R1 | 4.54 |
| Gene 2 | WI | 12.R1 | 1.36 |
| Gene 3 | WI | 12.R1 | 2.18 |
| Gene 1 | WI | 12.R2 | 5.53 |
| Gene 2 | WI | 12.R2 | 2.96 |
| Gene 3 | WI | 12.R2 | 5.56 |
| Gene 1 | WI | 12.R4 | 4.41 |
| Gene 2 | WI | 12.R4 | 1.31 |
| Gene 3 | WI | 12.R4 | 2.54 |
| Gene 1 | WM | 12.R1 | 3.85 |
| Gene 2 | WM | 12.R1 | 1.18 |
| Gene 3 | WM | 12.R1 | 2.54 |
| Gene 1 | WM | 12.R2 | 4.18 |
| Gene 2 | WM | 12.R2 | 1.19 |
| Gene 3 | WM | 12.R2 | 3.04 |
| Gene 1 | WM | 12.R4 | 3.73 |
| Gene 2 | WM | 12.R4 | 0.89 |
| Gene 3 | WM | 12.R4 | 1.35 |
genes %>%
gather(variable, expr, -id) %>%
separate(variable, c("trt", "leftover"), "-") %>%
separate(leftover, c("time", "rep"), "\\.") %>% kable
| id | trt | time | rep | expr |
|---|---|---|---|---|
| Gene 1 | WI | 6 | R1 | 2.18 |
| Gene 2 | WI | 6 | R1 | 1.46 |
| Gene 3 | WI | 6 | R1 | 2.03 |
| Gene 1 | WI | 6 | R2 | 2.20 |
| Gene 2 | WI | 6 | R2 | 0.59 |
| Gene 3 | WI | 6 | R2 | 0.87 |
| Gene 1 | WI | 6 | R4 | 4.20 |
| Gene 2 | WI | 6 | R4 | 1.86 |
| Gene 3 | WI | 6 | R4 | 3.28 |
| Gene 1 | WM | 6 | R1 | 2.63 |
| Gene 2 | WM | 6 | R1 | 0.52 |
| Gene 3 | WM | 6 | R1 | 0.53 |
| Gene 1 | WM | 6 | R2 | 5.06 |
| Gene 2 | WM | 6 | R2 | 2.88 |
| Gene 3 | WM | 6 | R2 | 4.63 |
| Gene 1 | WI | 12 | R1 | 4.54 |
| Gene 2 | WI | 12 | R1 | 1.36 |
| Gene 3 | WI | 12 | R1 | 2.18 |
| Gene 1 | WI | 12 | R2 | 5.53 |
| Gene 2 | WI | 12 | R2 | 2.96 |
| Gene 3 | WI | 12 | R2 | 5.56 |
| Gene 1 | WI | 12 | R4 | 4.41 |
| Gene 2 | WI | 12 | R4 | 1.31 |
| Gene 3 | WI | 12 | R4 | 2.54 |
| Gene 1 | WM | 12 | R1 | 3.85 |
| Gene 2 | WM | 12 | R1 | 1.18 |
| Gene 3 | WM | 12 | R1 | 2.54 |
| Gene 1 | WM | 12 | R2 | 4.18 |
| Gene 2 | WM | 12 | R2 | 1.19 |
| Gene 3 | WM | 12 | R2 | 3.04 |
| Gene 1 | WM | 12 | R4 | 3.73 |
| Gene 2 | WM | 12 | R4 | 0.89 |
| Gene 3 | WM | 12 | R4 | 1.35 |
gtidy <- genes %>%
gather(variable, expr, -id) %>%
separate(variable, c("trt", "leftover"), "-") %>%
separate(leftover, c("time", "rep"), "\\.") %>%
mutate(trt = sub("W", "", trt)) %>%
mutate(rep = sub("R", "", rep))
kable(head(gtidy))
| id | trt | time | rep | expr |
|---|---|---|---|---|
| Gene 1 | I | 6 | 1 | 2.18 |
| Gene 2 | I | 6 | 1 | 1.46 |
| Gene 3 | I | 6 | 1 | 2.03 |
| Gene 1 | I | 6 | 2 | 2.20 |
| Gene 2 | I | 6 | 2 | 0.59 |
| Gene 3 | I | 6 | 2 | 0.87 |
gtidy), find the mean expression for each combination of id, trt, and time.melbtemp.m <- melbtemp %>%
select(num_range("V", c(1,2,3,4,seq(5,128,4)))) %>%
filter(V4 %in% c("PRCP", "TMAX", "TMIN")) %>%
gather(day, value, V5:V125, na.rm = TRUE) %>%
spread(V4, value) %>%
mutate(
tmin = as.numeric(TMIN) / 10,
tmax = as.numeric(TMAX) / 10,
t_range = tmax - tmin,
prcp = as.numeric(PRCP) / 10
) %>%
rename(stn=V1, year=V2, month=V3)
kable(head(melbtemp.m))
| stn | year | month | day | PRCP | TMAX | TMIN | tmin | tmax | t_range | prcp |
|---|---|---|---|---|---|---|---|---|---|---|
| ASN00086282 | 1970 | 7 | V101 | 0 | 158 | 78 | 7.8 | 16 | 8.0 | 0.0 |
| ASN00086282 | 1970 | 7 | V105 | 13 | 149 | 36 | 3.6 | 15 | 11.3 | 1.3 |
| ASN00086282 | 1970 | 7 | V109 | 3 | 133 | 61 | 6.1 | 13 | 7.2 | 0.3 |
| ASN00086282 | 1970 | 7 | V113 | 0 | 143 | 46 | 4.6 | 14 | 9.7 | 0.0 |
| ASN00086282 | 1970 | 7 | V117 | 25 | 150 | 42 | 4.2 | 15 | 10.8 | 2.5 |
| ASN00086282 | 1970 | 7 | V121 | 0 | 145 | 63 | 6.3 | 14 | 8.2 | 0.0 |
melbtemp.m$day <- factor(melbtemp.m$day,
levels=c("V5","V9","V13","V17","V21","V25","V29",
"V33","V37","V41","V45","V49","V53","V57",
"V61","V65","V69","V73","V77","V81","V85",
"V89","V93","V97","V101","V105","V109",
"V113","V117","V121","V125"),
labels=1:31)
melbtemp.m$date <- as.Date(paste(melbtemp.m$day,
melbtemp.m$month, melbtemp.m$year, sep="-"),
"%d-%m-%Y")
kable(head(melbtemp.m))
| stn | year | month | day | PRCP | TMAX | TMIN | tmin | tmax | t_range | prcp | date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ASN00086282 | 1970 | 7 | 25 | 0 | 158 | 78 | 7.8 | 16 | 8.0 | 0.0 | 1970-07-25 |
| ASN00086282 | 1970 | 7 | 26 | 13 | 149 | 36 | 3.6 | 15 | 11.3 | 1.3 | 1970-07-26 |
| ASN00086282 | 1970 | 7 | 27 | 3 | 133 | 61 | 6.1 | 13 | 7.2 | 0.3 | 1970-07-27 |
| ASN00086282 | 1970 | 7 | 28 | 0 | 143 | 46 | 4.6 | 14 | 9.7 | 0.0 | 1970-07-28 |
| ASN00086282 | 1970 | 7 | 29 | 25 | 150 | 42 | 4.2 | 15 | 10.8 | 2.5 | 1970-07-29 |
| ASN00086282 | 1970 | 7 | 30 | 0 | 145 | 63 | 6.3 | 14 | 8.2 | 0.0 | 1970-07-30 |
tb_tidy <- tb %>%
gather(demographic, cases, m_04:f_u, na.rm = TRUE) %>%
separate(demographic, c("sex", "age"), "_") %>%
rename(country = iso2) %>%
arrange(country, year, sex, age)
kable(head(tb_tidy))
| country | year | sex | age | cases |
|---|---|---|---|---|
| AD | 1996 | f | 014 | 0 |
| AD | 1996 | f | 1524 | 1 |
| AD | 1996 | f | 2534 | 1 |
| AD | 1996 | f | 3544 | 0 |
| AD | 1996 | f | 4554 | 0 |
| AD | 1996 | f | 5564 | 1 |
Dates are deceptively hard to work with.
Example: 02/05/2012. Is it February 5th, or May 2nd?
Other things are difficult too:
The lubridate, and timeDate package helps tackle some of these issues.
now() #> [1] "2016-11-30 17:47:06 AEDT" today() #> [1] "2016-11-30" now() + hours(4) #> [1] "2016-11-30 21:47:06 AEDT" today() - days(2) #> [1] "2016-11-28"
ymd("2013-05-14")
#> [1] "2013-05-14"
mdy("05/14/2013")
#> [1] "2013-05-14"
dmy("14052013")
#> [1] "2013-05-14"
ymd_hms("2013:05:14 14:5:30", tz = "Australia/Melbourne")
#> [1] "2013-05-14 14:05:30 AEST"
month(ymd("2013-05-14"))
#> [1] 5
year(ymd("2013-05-14"))
#> [1] 2013
wday(ymd("2013-05-14"), label=TRUE, abbr=TRUE)
#> [1] Tues
#> Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
isWeekday(ymd("2013-05-14"))
#> 2013-05-14
#> TRUE
For the pedestrian sensor data, extract
from the date variable
Can you find the odd one out?
Is it easier now?
internet <- read_csv("../data/internet.csv")
ggplot(data=internet, aes(x=`Social networks`, fill=Gender)) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("Female"="orange", "Male"="darkgreen")) +
facet_wrap(~name, ncol=5) +
theme(legend.position="bottom")
ggplot(data=grad, aes(x=subject, y=AvGREs)) + geom_boxplot() + scale_y_log10()
The date time axis is a little trickier to re-organise, but it can be done.
ggplot(data=rates.sub.m, aes(x=date, y=rate,
colour=currency)) + geom_line() +
scale_x_datetime(breaks = date_breaks("1 month"),
labels = date_format("%b"))
ggplot(data=rates.sub.m, aes(x=date, y=rate,
colour=currency)) + geom_line() +
xlab("Date") + ylab("Standardized rates") +
ggtitle("Cross rates 23/2/2015-11/11/2015")
ggplot(data=rates.sub.m, aes(x=date, y=rate, colour=currency)) +
geom_line() +
xlab(expression(Date[i]^2~ mu ~ pi * sigma)) +
ylab("Standardized rates") +
ggtitle("Cross rates 23/2/2015-11/11/2015")
Save the plot to an R object, p, for re-use
p <- ggplot(data=rates.sub.m, aes(x=date, y=rate,
colour=currency)) + geom_line() +
scale_x_datetime(breaks = date_breaks("1 month"),
labels = date_format("%b"))
p + theme(legend.position = "bottom")
p + theme_tufte()
p + theme_economist()
p + scale_color_brewer("", palette = "Dark2")
clrs <- hue_pal()(3)
p + scale_color_manual("", values=clrs) +
theme(legend.position = "none")
clrs <- dichromat(hue_pal()(3))
p + scale_color_manual("", values=clrs) +
theme(legend.position = "none")
clrs <- brewer.pal(3, "Dark2")
p + scale_color_manual("", values=clrs) +
theme(legend.position = "none")
clrs <- dichromat(brewer.pal(3, "Dark2"))
p + scale_color_manual("", values=clrs) +
theme(legend.position = "none")
Proximity - From the plot can you answer: Is the proportion of girls who use social networks every day (4) higher than boys, in Australia? And is this different in Germany?
Brainstorm with your neighbour ways to rearrange this plot to answer the question.
Then tackle this question: Are German girls more likely to report using social networks once or twice per month (1) than Japanese girls?
What ways would you re-arrange the plot to tackle this one?
Advanced graphics
Notes prepared by Di Cook, building on joint workshops with Carson Sievert, Heike Hofmann, Eric Hare, Hadley Wickham.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.